HALT — CERTIFICATION SUSPENDEDWhite-box self-test recall 0.71 is below the 0.90 red line. Runs are blocked until verifier recall recovers.
MOCK-LLM MODE — every model call is stubbed. No real tokens or dollars are spent. Disable in /admin/debug before any certified run.
CCrucibleoperator
/runs/r_8f3a · live run view
PER-RUN$4.18
SESSION$61.40
DEMO TOGGLES ·
CRUCIBLE · OPERATOR DASHBOARD
Design System
The canonical visual language for the adversarial-security operator dashboard. The palette is Graphite Meridian — a near-dark navy-graphite base with one restrained steel-cyan primary and AAA-tuned semantic accents, built to earn trust from a bank model-risk officer, a code-agent vendor, and a public-sector procurement officer at once. Every per-route slice references these tokens and copies these components verbatim.
IBM Plex Sans / Mono14px base · AAA bodyNo gradients · no decorative motion_palette_notes.md
01
Foundations · Color
SURFACES — graphite-navy, never pure black
base
#0E141B
surface
#161E27
surface-2
#1D2630
surface-3
#25303C
border
#2C3744
border-strong
#3A4654
TEXT — cool off-white, with measured contrast on base
text-hi
#E8EDF3 · headings & key numbers
14.5:1 · AAA
text
#B8C2CE · body copy
8.6:1 · AAA
text-mut
#7C8896 · labels & meta
4.7:1 · AA
PRIMARY & SEMANTIC — restrained, AAA-tuned on base
primary
#4FAAC0
links · brand · detection line
success
#57C08A
oracle PASS · healthy
danger
#E5736B
oracle FAIL · destructive
warning
#D9A441
amber health · ASR line
halt banner
bg #5E1A1A · text #E5B5B0
mock-LLM banner
bg #3A3413 · text #E8C84A
02
Foundations · Type
IBM PLEX SANS · UI & BODY
display / 42·700Verified
h1 / 24·600Live Run View
h2 / 16·600Oracle votes
body / 14·400The red agent rewards a held-out obligation.
label / 12·500Budget remaining
IBM PLEX MONO · CODE · TRACES · DOLLARS
obligation:held_out_tests.pass_rate >= 0.95
observed:0.82→FAIL
tokens:3,114·$0.041
// every $ amount is mono, always
Mono carries all dollar amounts, token counts, run IDs, prompts, raw responses, audit JSON, and any aligned numeric column.
03
Foundations · Spacing & Radius
SPACE SCALE · 4px base
4
8
12
16
24
32
48
RADIUS
5 · controls
7 · chips
8 · cards
full · dots
Tight radii throughout — the dashboard reads as an instrument, not a consumer app. Nothing rounder than 8px except status dots.
Six components copied verbatim into every route. Transparency-first: every one drills into underlying LLM calls, sandbox jobs, or captured seeds. Secrets (API keys, DB creds, sandbox tokens) are the only things ever hidden.
InspectButton
Magnifier on every reasoning-trace line, oracle card, and producer-output panel. Opens a right drawer with the LLM call (prompt, raw response, parsed output, tokens, dollars) or sandbox job (env, network rules, exit code, stdout, stderr).
ReplayButton
Sits next to any action with a captured seed. Opens a drawer showing original output and replay output, diffed line by line — the spine of the audit-row replayer.
AuditTraceCard
Per-oracle: obligation, observation, reasoning, pass-or-fail. Identical shape on Verdict Detail and the Live Run View. The LLM Judge renders smaller — it is "one vote."
Held-Out Tests✓ PASS
OBLIGATION
pass_rate >= 0.95
OBSERVED
0.98 across 220 sealed cases
REASONING
No held-out case regressed under the candidate patch.
Metamorphic Relations✕ FAIL
OBLIGATION
invariance == true
OBSERVED
label flipped on amount-scaled input
REASONING
Scaling the transaction 10× should not change the fraud verdict; it did.
½LLM JudgeONE VOTE ⓘSmaller weight than the four independent oracles. Hover the badge for why.✓ PASS
HealthBadge
Dot + timestamp + optional error. On /health leaves and inline anywhere a subcomponent is named.
Producer Sandboxok · 14:02:11Z
Oracle: Differentiallast-good 13:51Z · timeout
Blue: Patch TrainerCUDA OOM at step 1.2k
CostChip
Inline dollar amount, tooltip carries pillar + run ID. On every catalog row, blue patch, LLM call.
You are the held-out test oracle. Given the
producer output and the sealed obligation,
report pass_rate over the 220 held-out cases.
Cite each regressed case by id.